Abstract: Medical databases have accumulated huge amounts of information about patients and their medical conditions. Relationships and patterns within the data can provide new medical knowledge. Huge amount of Electronic Health Records (EHRs) are collected over the years have provided a rich base for risk analysis and prediction. An EHR contains digitally stored healthcare information about an individual, such as observations, laboratory tests, diagnostic reports, medications, patient identifying information, and allergies. A special type of EHR is the Health Examination Records (HER) from annual general health check-ups. The fundamental challenge of learning a classification model for risk prediction lies in the unlabelled data that constitutes most the collected dataset. Particularly, the unlabelled data describes the participants in health examinations whose health conditions can vary greatly from healthy to very-ill. There is no ground truth for differentiating their states of health. Identifying participants at risk based on their current and past HERs is important for early warning and preventive intervention. Risk means unwanted outcomes such as mortality and morbidity. The proposed system presents a Semi-supervised learning algorithm to handle a challenging multi-class classification problem with substantial unlabelled cases. This algorithm constructs a training set from the diabetes records with unlabelled classes and performs risk analysis with user queries reports. The process shows a new way of predicting risks for participants based on their annual health examinations.
Keywords: Medical database, association rule mining,semi-supervised learning.